Zurich - 27 & 28 June 2022

Iris Eekhout

PhD project Don’t Miss Out! | www.missingdata.nl

Personal website: www.iriseekhout.com

GitHub: iriseekhout

Interests:

  • Missing data methods (multiple imputation)
  • Longitudinal (multilevel) modelling
  • R programming/shiny apps
  • Psychometrics and measurement

Missing data analyses

Workshop info

  • Missing data analysis

  • Missing data methods

  • Missing data in practice

  • 2-day workshop

Program

Day 1

Morning:

  • Lecture 1: Missing data analyses

    • What are missing data, why do we have missing data?
    • What kind of missing data are there?
      • What are missing data mechanisms?
      • How can we investigate our missing data?
  • Practical 1: Missing data analyses

Program

Day 1

Afternoon:

  • Lecture 2: Dealing with missing data

    • Ad hoc solutions and what’s wrong with that?
    • Multiple imputation - how does it work
    • Overview of imputation methods
  • Software demo: How to perform multiple imputation with MICE

  • Practical 2: Dealing with missing data

Program

Day 2

Morning:

  • Lecture 3: Missing data in practice

    • Full information maximum likelihood
    • Missing data in questionnaires
    • Missing data in longitudinal designs
  • Practical 3: Missing data in practice

    • Discuss practical issues of participants.

Suggested literature

Most important R packages

For missing data handling:

  • mice: multiple imputation by chained equations
  • miceadds: additional Multiple Imputation Functions, Especially for ‘mice’

Other R packages

  • dplyr: for data manipulation, summarizing and grouping
  • tidyr: tools for shaping and structuring data.

Missing value analysis

What are missing values

Missing observations are defined as NA in R.

Missing data can have different implications for data summaries, analyses and conclusions based on the data with missing values.

Amount of missing data

  • Matrix perspective
  • Variable perspective
  • Case perspective

Example data

The example data has 25 rows and 5 columns.

head(datm, 15)
>            X1         X2          X3          X4          X5
> 1  -1.4568301         NA -1.91315289 -1.42624613          NA
> 2  -0.3893474  0.4363230 -0.19530999 -0.42230577  0.64451958
> 3   0.2503727 -0.5855374 -0.64329498 -0.13405753  0.46480082
> 4          NA  1.0630725  0.59973864          NA          NA
> 5   0.1470515  1.2481788 -0.06758740 -0.12847872 -0.23417613
> 6          NA  1.9420598  0.48763912          NA          NA
> 7  -0.5864552  0.4723545  0.76641348  0.34962006 -0.92113775
> 8  -2.1632037         NA -2.03224736 -1.81816998          NA
> 9  -3.1243917         NA -1.71775149 -2.17786279          NA
> 10  0.8641733  0.7684050  0.43646038  1.26555557  1.71371478
> 11 -0.1621988 -0.7730129 -0.35266991 -0.15216926  0.84706787
> 12 -0.1226546  0.7488797 -0.39710003  1.00688563 -0.07092473
> 13         NA -1.8994749 -1.96920043          NA          NA
> 14  0.2431816         NA  0.04794798 -0.01728637          NA
> 15  1.4608764         NA  0.92040707  1.04589393          NA

Matrix perspective

Matrix perspective: the number of missing entries in the data matrix.

The is.na function returns TRUE if a cell is missing (NA) and FALSE if a cell is observed.

In the example there are 24 missing data entries. The data frame contains 5 variables for 25 subjects, which makes a total of 125 data entries. So, 19.2% of the data entries are missing.

sum(is.na(datm))
> [1] 24
sum(is.na(datm))/length(is.na(datm))
> [1] 0.192

Variables perspective

Variables perspective: the number of missing values per variable.

For each variable we can count the number of missing observations (n) and calculate the proportion (p).

datm %>%
  is.na %>%
  data.frame() %>%
  summarise_all(list(n = sum, p = mean)) %>%
  pivot_longer(everything(), 
               names_to = c("variable", ".value"),
               names_pattern = "(.*)_(.)")
> # A tibble: 5 x 3
>   variable     n     p
>   <chr>    <int> <dbl>
> 1 X1           4  0.16
> 2 X2           6  0.24
> 3 X3           0  0   
> 4 X4           4  0.16
> 5 X5          10  0.4

Case perspective

Case perspective: the number of rows, i.e. cases, with missing values.

Many analysis methods only use the rows that are fully observed: complete-case analysis.

The data are then listwise deleted.

datm %>% 
  is.na %>%
  data.frame() %>%
  mutate(n_miss = rowSums(.),
         missing = ifelse(n_miss > 0, "rows with misings", "rows without missing")) %>%
  group_by(missing) %>%
  summarise(n = n(),
            p = n/ 25)
> # A tibble: 2 x 3
>   missing                  n     p
>   <chr>                <int> <dbl>
> 1 rows with misings       10   0.4
> 2 rows without missing    15   0.6

Case perspective - mice

  • cci: create an indicator for the number of fully observed rows.
mice::cci(datm)
>  [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
> [13] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
> [25]  TRUE
  • nic: count the number of incomplete cases, i.e. cases with missing values.
mice::nic(datm)
> [1] 10
  • ncc: count the number of complete cases, i.e. cases full fully observed rows.
mice::ncc(datm)
> [1] 15

Missing data patterns

Missing data pattern: the combination of observed and unobserved values that occur together in a row. Generally notated as having a 0 for a missing value and a 1 for an observed value.

Data often contains multiple different missing data patterns. The example shows three missing data patterns:

  1. All variables are observed, so a row of only ones.
  2. Three variables observed and two missing.
  3. Two variables observed and three missing.
mice::md.pattern(datm, plot= F)
>    X3 X1 X4 X2 X5   
> 15  1  1  1  1  1  0
> 6   1  1  1  0  0  2
> 4   1  0  0  1  0  3
>     0  4  4  6 10 24

row-names: the number of times the pattern occurs in the data; last column: the number missing values the missing data pattern holds.

Missing data pairs

Missing data pair: the number of times two variables are either missing together or observed together.

How many cases we can actually use for imputation. The md.pair function from the mice package returns four matrices. Each matrix gives us information about combinations of missing values in our data.

  • response-response (rr) the count of how often two variables are both observed.
  • response-missing (rm) the count of how often the row-variable is observed and the column-variable is missing.
  • missing-response (mr) the count of how often the row-variable is missing and the column-variable is observed.
  • missing-missing (mm) the count of how often two variables are both missing.
pat <- mice::md.pairs(datm)

Response-response

Observed value counts.

pat$rr
>    X1 X2 X3 X4 X5
> X1 21 15 21 21 15
> X2 15 19 19 15 15
> X3 21 19 25 21 15
> X4 21 15 21 21 15
> X5 15 15 15 15 15

Response-missing

Missing value counts when rows are observed.

pat$rm
>    X1 X2 X3 X4 X5
> X1  0  6  0  0  6
> X2  4  0  0  4  4
> X3  4  6  0  4 10
> X4  0  6  0  0  6
> X5  0  0  0  0  0

Missing-response

Missing value counts when columns are observed.

pat$mr
>    X1 X2 X3 X4 X5
> X1  0  4  4  0  0
> X2  6  0  6  6  0
> X3  0  0  0  0  0
> X4  0  4  4  0  0
> X5  6  4 10  6  0

Missing-missing

Missing value counts.

pat$mm
>    X1 X2 X3 X4 X5
> X1  4  0  0  4  4
> X2  0  6  0  0  6
> X3  0  0  0  0  0
> X4  4  0  0  4  4
> X5  4  6  0  4 10

Information for imputation

The proportion missing-response from the sum of the missing-response and missing-missing matrices shows how many usable cases the data have to impute the row variable from the column variable.

round(100 * pat$mr / (pat$mr + pat$mm))
>     X1  X2  X3  X4  X5
> X1   0 100 100   0   0
> X2 100   0 100 100   0
> X3 NaN NaN NaN NaN NaN
> X4   0 100 100   0   0
> X5  60  40 100  60   0

X3 has no missing values

Missing data mechanisms

Types of missing data

In research, missing data occur when a data value is unavailable. Many empirical studies encounter missing data. Missing data can occur in many stages of research due to many different causes in many different forms.

  • Non-response: an invited respondent does not participate in the study.
  • Intermittent missing data: missing data on one or more of the measured variables that are used as a predictor, covariate or outcome.
  • drop-out or loss to follow-up: participants in a longitudinal study do not show up at one or more repeated measurement occasions.

Each type of missing data may have different reasons, and also different implication for the methods to deal with the missing data.

Missing data mechanisms

The underlying causes of missing data as missing data mechanisms and were first described by Rubin (1976).

Rubin distinguished three missing data mechanisms:

  • missing completely at random (MCAR)
  • missing at random (MAR)
  • missing not at random (MNAR)

Missing completely at random (MCAR)

Missing data are MCAR when the probability of missing data on a variable is unrelated to any other measured variable and is unrelated to the variable with missing values itself.

In other words the missingness on the variable is completely unsystematic.


Data example

Below the description of the complete data example. We will use this example to show the implications of each missing data mechanism.

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0.00 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
X2 2 1000 0.09 2.26 0.14 0.12 2.24 -6.73 6.58 13.31 -0.12 -0.15 0.07
X3 3 1000 -0.03 2.25 0.02 -0.04 2.27 -7.27 6.54 13.81 0.00 -0.22 0.07

MCAR data example

When we create MCAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have not changed much:

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
Description of MCAR data - 50%
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 490 0.01 2.1 0.02 0.04 2.14 -7.38 5.58 12.96 -0.18 -0.1 0.09

MCAR distribution

Probabily for MCAR

We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.

mcar <- mcar %>% mutate(R1 = is.na(X1))

Missing at random (MAR)

Missing data are MAR when the probability of missing data on a variable is related to some other measured variable in the model, but not to the value of the variable with missing values itself.

For example, older people more often have missing values for IQ. In that case the probability of missing data on IQ is related to age.


MAR data example

When we create MAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have changed:

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
Description of MAR data - 50%
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 512 -0.28 2.25 -0.27 -0.26 2.3 -7.42 5.34 12.76 -0.13 -0.25 0.1

MAR distribution

Probability of MAR

We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.

mar <- mar %>% mutate(R1 = is.na(X1))

The difference between the group with missing values (TRUE) and the group without missing values (FALSE) shows that having missing data is related to the scores on the other variables.

Missing not at random (MNAR)

Data are MNAR when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables.

For example, when weight data are missing mostly for the more heavy persons.


MNAR data example

When we create MNAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have changed:

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
Description of MNAR data - 50%
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 488 -1.12 1.96 -1.17 -1.18 1.53 -7.42 5.11 12.53 0.26 0.73 0.09

MNAR distribution

Probability of MNAR

We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.

mnar <- mnar %>% mutate(R1 = is.na(X1))

The difference between the group with missing values (TRUE) and the group without missing values (FALSE) shows that having missing data is related to the scores on the other variables.

Evaluate the missing data mechanism

Reason for missing data

Any information about the research process can provide valuable information that helps to evaluate and make assumptions about the missing data mechanism.

Why are data missing?


Missing data mechanisms

  • Missing completely at random: missing data is a completely random subsample of the observed data.
  • Missing at random: probability of missing data is related to other measured variables.
  • Missing not at random: probability of missing data is related to the missing data itself, and other measured variables.

Testing the mechanisms

The missing data mechanisms are defined by the probability that missing data occur.

Probability is not related to other measured variables

  • Assume the remaining sample is a totally random subsample (MCAR).

Other measured variables are related tot the probability of missing data

  • Assume the data are not MCAR. However, we cannot definitively rule out MNAR, because we in practice we never know the missing data itself.

Statistical tests

The essence of testing for MCAR is to compare the group with missing data to the group without missing data.

Univariate testing

  • Independent samples t-test to compare for continuous measures
  • Chi-square test to compare for categorical measures

Multivariate testing

  • Logistic regression to evaluate multivariately
  • Little’s MCAR test

T-test to evaluate MCAR

Independent samples T-test to compare the mean of continuous variables between the group with missing data to the group without missing data.

Note that the T-test assumes normally distributed data and homogeneity of variance.

T-test

MCAR example

t.test(X2 ~ R1, data = mcar)
> 
>   Welch Two Sample t-test
> 
> data:  X2 by R1
> t = -0.11866, df = 996.42, p-value = 0.9056
> alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
> 95 percent confidence interval:
>  -0.2970947  0.2632135
> sample estimates:
> mean in group FALSE  mean in group TRUE 
>          0.07756972          0.09451031

T-test

MAR example

t.test(X2 ~ R1, data = mar)
> 
>   Welch Two Sample t-test
> 
> data:  X2 by R1
> t = -13.814, df = 994.06, p-value < 2.2e-16
> alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
> 95 percent confidence interval:
>  -2.064496 -1.550917
> sample estimates:
> mean in group FALSE  mean in group TRUE 
>          -0.7959512           1.0117550

T-test

MNAR example

t.test(X2 ~ R1, data = mnar)
> 
>   Welch Two Sample t-test
> 
> data:  X2 by R1
> t = -3.9996, df = 973.87, p-value = 6.824e-05
> alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
> 95 percent confidence interval:
>  -0.8467042 -0.2893187
> sample estimates:
> mean in group FALSE  mean in group TRUE 
>          -0.2046124           0.3633990

T-test

Univariate method.

When there are no significant differences we may assume the data are MCAR. Otherwise, we assume not-MCAR (i.e. MAR or MNAR).

Note that we can never truly rule out MNAR.

Chi-square test to evaluate MCAR

Chi-square test to compare the categorical variables for the group with missing data to the group without missing data.

Test to compare the distribution over the categories between the groups.

Note that the Chi-square test assumes that the expected cell frequencies should not be too small.

Chi-square

MCAR example

mcar <- mcar %>% mutate(X3c = ifelse(X3 > 0, 1, 0))
chisq.test(mcar$R1, mcar$X3c)
> 
>   Pearson's Chi-squared test with Yates' continuity correction
> 
> data:  mcar$R1 and mcar$X3c
> X-squared = 0.062121, df = 1, p-value = 0.8032

Chi-square

MAR example

mar <- mar %>% mutate(X3c = ifelse(X3 > 0, 1, 0))
chisq.test(mar$R1, mar$X3c)
> 
>   Pearson's Chi-squared test with Yates' continuity correction
> 
> data:  mar$R1 and mar$X3c
> X-squared = 80.787, df = 1, p-value < 2.2e-16

Chi-square

MNAR example

mnar <- mar %>% mutate(X3c = ifelse(X3 > 0, 1, 0))
chisq.test(mar$R1, mnar$X3c)
> 
>   Pearson's Chi-squared test with Yates' continuity correction
> 
> data:  mar$R1 and mnar$X3c
> X-squared = 80.787, df = 1, p-value < 2.2e-16

Logistic regression to evaluate MCAR

The probability of missing data can also be investigated in a logistic regression analysis.

The missing data indicator is the dependent variable and the other variables that may be related to the probability of missing data are the independent variables.

Test for multiple independent variables at once.

The results of the logistic regression analysis show if the independent variables relate to the probability of missing data.

Note that when the other variables have missing values as well, a complete-case analysis is used per default.

Logistic regression

MCAR example

glm(R1 ~ X2 + X3, data = mcar) %>%
  summary %>% coefficients %>% round(.,3)
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)    0.510      0.016  32.171    0.000
> X2             0.002      0.007   0.277    0.782
> X3            -0.006      0.007  -0.870    0.384

Logistic regression

MAR example

glm(R1 ~ X2 + X3, data = mar) %>%
  summary %>% coefficients %>% round(.,3)
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)    0.483      0.014  34.573        0
> X2             0.078      0.006  12.460        0
> X3             0.056      0.006   8.909        0

Logistic regression

MNAR example

glm(R1 ~ X2 + X3, data = mnar) %>%
  summary %>% coefficients %>% round(.,3)
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)    0.483      0.014  34.573        0
> X2             0.078      0.006  12.460        0
> X3             0.056      0.006   8.909        0

Logistic regression

In the MCAR example both X2 and X3 are not related to the probability of missing data in X1, so we may assume that the missing data in X1 are MCAR.

However, in the MAR and MNAR examples, both variables are related tot he probability of missing data in X1, so in that case we can assume that the data are not-MCAR.

We cannot differentiate between MAR and MNAR in this situation, since cannot test the missing values itself.

Little’s MCAR test

  • A multivariate test that evaluates the subgroups of the data that share the same missing data pattern.

  • Per subgroup (with same missing data pattern): observed means versus estimated means based on the expectation-maximization algorithm.

  • Chi-square distribution test to test the null hypothesis that data are MCAR.

  • A significant result shows that the data are not-MCAR.

Little’s MCAR test

MCAR example

misty::na.test(mcar %>% select(X1:X3))
>  Little's MCAR Test
> 
>      n nIncomp nPattern chi2 df  pval 
>   1000     510        2 0.77  2 0.679

Little’s MCAR test

MAR example

misty::na.test(mar %>% select(X1:X3))
>  Little's MCAR Test
> 
>      n nIncomp nPattern   chi2 df  pval 
>   1000     488        2 222.52  2 0.000

Little’s MCAR test

MNAR example

misty::na.test(mnar %>% select(X1:X3))
>  Little's MCAR Test
> 
>      n nIncomp nPattern   chi2 df  pval 
>   1000     488        2 222.52  2 0.000

Little’s MCAR test notes

  • No specific information about which variables are related to the probability of missing data.

  • Test assumes multivariate normality and can only be applied to continuous variables.

  • The MNAR mechanism can never be ruled out, regardless of the result of the test.

Assuming the missing data mechanism (MCAR)

The methods to deal with missing data, implicitly assume a missing data mechanism.

MCAR: the most strict assumption. In practice it is also easiest to deal with MCAR data.

  1. Analyze the observed sample only (this will result in unbiased estimates).
  2. Use an imputation method to boost the power when the amount of missing data is too large.

Assuming the missing data mechanism (MAR)

MAR: less strict assumption. Most advanced missing data methods assume this mechanism (e.g. multiple imputation, FIML).

  • Include variables in study that may explain the missing data, a MAR assumption may become more plausible (as compared to MNAR).
  • These auxiliary variables may also help in dealing with the missing data.
  • Auxiliary variables: variables related to the probability of missing data or to the variable with missing data.
    • Can be used as predictors in an imputation model or as covariates in a FIML model to improve estimations.

Assuming the missing data mechanism (MNAR)

MNAR: least strict assumption.

  • MNAR data are also referred to as non-ignorable, because these cannot be ignored without causing bias in results. MNAR data are more challenging to deal with.